Statistical Functions with Missing Values

Motivation

The previous functions ignore NA values, which is not ideal for real-world data. This chapter introduces an additional argument and checks to the functions that allow the user to remove missing values (including NaN) from the vector. These examples were adapted from @vaughan.

Load the Package

I loaded the ece244 package as I added the functions from the next sections to it with the following code:

load_all()

Additional Packages

I used the bench package to compare the performance of the functions. The package was loaded with the following code:

library(bench)

Sum of Vector Elements (sum())

The next function returns the sum of the elements of a vector:

[[cpp11::register]] double sum2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }
  return total;
}

Unlike the sum function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the sum of the elements of a vector (C++)
#' @inheritParams sum_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
sum2_cpp <- function(x, na_rm = FALSE) {
  sum2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000 elements
x[sample(1:1e3, 1e2)] <- NA # randomly insert NA values

sum(x, na.rm = FALSE)
[1] NA
sum2_cpp(x, na_rm = FALSE)
[1] NA
sum(x, na.rm = TRUE)
[1] 447.4191
sum2_cpp(x, na_rm = TRUE)
[1] 447.4191
mark(
  sum(x, na.rm = TRUE),
  sum2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 sum(x, na.rm = TRUE)         725ns    785ns  1184974.        0B     0   
2 sum2_cpp(x, na_rm = TRUE)     10µs   10.6µs    91536.        0B     9.15

Arithmetic Mean (mean())

The next function calculates the mean of the elements of a vector:

[[cpp11::register]] double mean2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }

  return total / m;
}

Unlike the mean function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the mean of the elements of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
mean2_cpp <- function(x, na_rm = FALSE) {
  mean2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

mean(x)
[1] NA
mean2_cpp(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 0.4971323
mean2_cpp(x, na_rm = TRUE)
[1] 0.4971323
mark(
  mean(x, na.rm = TRUE),
  mean2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mean(x, na.rm = TRUE)        5.72µs    7.1µs   116334.    22.5KB     58.2
2 mean2_cpp(x, na_rm = TRUE)  15.54µs   16.7µs    58035.        0B      0  

Variance (var())

The next function calculates the variance of a vector:

[[cpp11::register]] double var2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0, sq_total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += x[i];
      sq_total += pow(x[i], 2);
    }
  }

  if (m <= 1) {
    return NA_REAL;
  }

  return (sq_total - total * total / m) / (m - 1);
}

Unlike the variance function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the variance of the elements of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
var2_cpp <- function(x, na_rm = FALSE) {
  var2_cpp_(as.double(x), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

var(x)
[1] NA
var2_cpp(x)
[1] NA
var(x, na.rm = TRUE)
[1] 0.08155043
var2_cpp(x, na_rm = TRUE)
[1] 0.08155043
mark(
  var(x, na.rm = TRUE),
  var2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 var(x, na.rm = TRUE)        6.01µs   7.49µs   126523.    3.95KB    25.3 
2 var2_cpp(x, na_rm = TRUE)  38.56µs  39.72µs    24815.        0B     2.48

Root Mean Square Error (RMSE)

The next function calculates the root mean square error between observed values (\(x\)) and the true value (\(x_0\)):

[[cpp11::register]] double rmse2_cpp_(doubles x, double x0, bool na_rm = false) {
  int n = x.size();
  int m = 0;
  double total = 0;

  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
      total += pow(x[i] - x0, 2);
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  return sqrt(total / m);
}

Unlike the RMSE function from Chapter 3, this function has an additional argument na_rm that allows the user to remove missing values (including NaN) from the vector.

The corresponding auxiliary function for documentation is:

#' Return the root mean square error (C++)
#' @inheritParams rmse_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
rmse2_cpp <- function(x, x0, na_rm = FALSE) {
  rmse2_cpp_(as.double(x), as.double(x0), na_rm = na_rm)
}

To test the functions, I ran the following benchmark code in the R console:

# create a list with 100 normal distributions with mean 0 and 1,000 elements each
set.seed(123)
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

# insert NA values at random
x[sample(1:1e3, 1e2)] <- NA

rmse2_cpp(x, 0)
[1] NA
rmse2_cpp(x, 0, na_rm = TRUE)
[1] 0.02992032
mark(
  sqrt(mean((x - 0)^2, na.rm = TRUE)),
  rmse2_cpp(x, 0, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                            min  median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
1 sqrt(mean((x - 0)^2, na.rm = TRU…  6.96µs  8.26µs   114492.    30.4KB     68.7
2 rmse2_cpp(x, 0, na_rm = TRUE)      27.7µs  30.6µs    31861.        0B      0  

References